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(57) In a superscalar computer system, a plurality of 
instructions are executed concurrently. The instructions 
being executed access data stored at addresses of the 
superscalar computer system. An instruction generator, 
such as a compiler, partitions the instructions into a plu- 
rality of sets. The plurality of sets are disjoint according 
to the addresses of the data to be accessed by the in- 
structions while executing in the superscalar computer 



system. The system includes a plurality of clusters for 
executing the instructions. There is one cluster for each 
one of the plurality of sets of instructions. Each set of 
instructions is distributed to the plurality of clusters so 
that the addresses of the data accessed by the instruc- 
tions are substantially disjoint among the clusters while 
immediately executing the instructions. This partitioning 
and distributing minimizes the number of interconnects 
between the clusters of the superscalar computer. 
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Description 

FIELD OF THE INVENTION 

This invention relat s generally to comput r sys- 
tems, and more particularly to processors which can is- 
sue multiple instructions during each processor cycle. 

BACKGROUND OF THE INVENTION 

In order to achieve higher performance, modern 
computer systems are beginning to issue more than one 
instruction for each processor clock cycle. Each instruc- 
tion includes a single operation code (opcode) specify- 
ing its function, as well as one or more operands for 
specifying addresses of data. The data addresses can 
be memory addresses or register addresses. Comput- 
ers that can issue more than one instruction for each 
clock cycle are called superscalar computers. 

Traditionally, because of the complexity of super- 
scalar computers, the number of instructions which can 
be issued per processor cycle has been relatively small, 
g., two to four instructions per cycle. Furthermore, the 
number of different types or classes of instructions 
which can be executed concurrently may be limited. By 
way of example, a triple-issue processor might be able 
to concurrently issue an arithmetic instruction, a mem- 
ory reference instruction, and a branch instruction. How- 
ever, the traditional superscalar processor can not con- 
currently issue three memory reference instructions. 

Each instruction may include source and destina- 
tion operands. The operands can specify addresses of 
data manipulated by the instructions. While executing, 
the data are stored in high-speed registers that are part 
of the processor. Usually, registers that have a common 
architecture are organized into sets of registers, known 
as register files. 

A processor may be equipped with separate float- 
ing-point and fixed-point or integer register files. Ports 
are used to read and write the register files. By restrict- 
ing the number and type of instructions which can con- 
currently issue, the access paths or "ports" of registers 
can be simplified. For example, if only one fixed-point 
arithmetic instruction and only one fixed/point load/store 
instruction can issue concurrently, at most, three read 
or output ports, and two write or input ports are required 
to access the fixed-point registers. 

As superscalar processors are designed with larger 
issue widths, more ports to the register files may be re- 
quired. Increasing the number of ports consumes sur- 
face area of the semiconductor die used for the circuits 
of the processor. The number of circuits can increase 
worse than linear when the number of ports is increased. 
In addition, as the number of ports is increased, access 
latenci s can also increase. 

On approach avoiding the disadvantages of a 
large multi ported register fil would have multiple copies 
of the various register fil s, one copy for each possible 



data path. Then, the numb r of r ad (output) ports re- 
quired for ach r gister file can be reduced. However, 
having multiple copies of the regist r files increases the 
compl xity of write acc ss s. Data stored in one copy 
5 of the register file must be duplicated in other copies of 
the register file. This means additional write (input) 
ports, and hence, the total number of ports is increased. 
Also, with duplicate register files the chip area must in- 
crease. 

io Therefore, it is desired to have means and methods 
which increase the number of instructions concurrently 
issued by a superscalar processor without substantially 
increasing the complexity of interconnects of the regis- 
ters used to store data manipulated by the executing in- 

is structions. 

SUMMARY OF THE INVENTION 

Disclosed is a method and apparatus for dynami- 
20 cally scheduling instructions to multiple execution units 
of a superscalar processor. The apparatus, using "hints" 
provided during the generation of the instructions, 
schedules instructions so that the performance of the 
processor is increased. In the superscalar computer 
25 system, a plurality of instructions are executed concur- 
rently. The instructions being executed access data 
stored at addresses of sets of registers of the supersca- 
lar computer system. 

The invention, in its broad form, resides in a super- 
30 scalar processor as recited in claim 1 , and a method for 
executing instructions in a superscalar processor as re- 
cited in claim 11 . In a preferred embodiment, an instruc- 
tion generator, such as a compiler, partitions the instruc- 
tions into a plurality of sets of instructions. The plurality 
35 of sets of instructions are substantially disjoint according 
to the addresses of the data to be accessed by the in- 
structions while executing in the superscalar computer 
system. 

In a second embodiment of the invention, the su- 
40 perscalar system includes a plurality of executbn clus- 
ters for executing the instructions. There is one cluster 
associated with each one of the plurality of sets of reg- 
isters. The "cluster" is physically organized around a set 
of registers to decrease Xhe { length of the wiring runs. 
45 Each cluster includes a plurality of execution units, a 
register renaming unit, a dispatch buffer, and an instruc- 
tion scheduler. The physical addresses of the sets of 
registers are also substantially disjoint among the clus- 
ters. 

so As described herein, during operation of the appa- 
ratus, the sets of instructions are distributed to the plu- 
rality of clusters so that the addresses of the data ac- 
cessed by the operands of the instructions are substan- 
tially disjoint among the plurality of clusters white imme- 
55 diately executing the instructions. This partitioning and 
distributing of th instructions increases the number of 
instructions which can concurrently b issued by a su- 
perscalar processor without substantially increasing the 
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complexity of interconnects of th registers usedtostore 
data manipulat d by the executing instructions. 

BRIEF DESCRIPTION OF THE DRAWINGS 

A more detailed understanding of the invention can 
be had from the following description of preferred em- 
bodiments, given by way of example and to be under- 
stood in conjunction with the accompanying drawing 
wherein: 

Figure 1 is a top-level block diagram of a computer 
system including a processor designed incorporat- 
ing the principles of the invention; 
Figure 2A is a high level block diagram of the proc- 
essor of Figure 1; 

Figure 2B is a detailed block diagram of the proc- 
essor of Figure 1 ; and 

Figure 3 is a portion of a program including instruc- 
tions to be scheduled for multiple execution clus- 
ters. 

DETAILED DESCRIPTION OF THE PREFERRED 
EMBODIMENT 

Now turning to Figure 1 , there is shown a computer 
system 100 including a central processor unit (CPU) 
200, a memory 120, a disk 130, and an input/output (I/ 
O) interface 1 40 connected to each other by a commu- 
nications bus 150. 

The CPU 200 is of the type which can concurrently 
issue multiple instructions during a single processor ex- 
ecution cycle. Such processors are generally known as 
superscalar processors. The CPU 200 can include on- 
chip caches, registers, and execution units. The caches 
can include separate instruction and data caches. The 
execution units can execute instructions of different 
types. For example, the units can access 122 data of 
the memory 120, e.g., load and store instructions of a 
program 121, perform arithmetic operations, for exam- 
ple, add and multiply instructions, and control execution 
flow, e.g., branch, jump, and call instructions. 

The registers can be general purpose, or dedicated 
to storing the data 122 formatted according to opera- 
tions performed on the data. For example, the registers 
can include sets of registers, e.g., register "files," spe- 
cifically designed to store floating-point, or fixed-point 
data. Certain registers may always store predetermined 
values, for example, zero and one, which are frequently 
used. Other registers, such as stack pointers, may have 
specialized functions. 

The memory 1 20 can be made of, for example, sem- 
iconductor circuits which can be accessed randomly by 
addresses. The memory 120 can b us d to store sig- 
nals repres nting instructions of software programs 1 21 
and data 122 which are proc ss dbyth CPU 200. The 
softwar programs can be operating system programs 
and application programs. The programs can also in- 



clude means for generating machine executabl in- 
structions such as text editors, compilers, assemblers, 
linkers, and so forth. The instructions 121 and the data 
1 22 can also be generated by other comput r systems. 

5 In the preferred embodiment of the invention, as x- 
plained in more detail b low, the instructions 121 ar 
generated as substantially disjoint sets. For example, 
the instructions 1 21 are partitioned among the sets ac- 
cording to addresses of the data 122 accessed by the 

10 operands of the instructions 1 21 while executing. In the 
CPU 200, the data being immediately manipulated by 
the executing instructions 121 are stored in the registers 
of the CPU 200. The registers are addressed by register 
operands of the instructions. Therefore, the disjoint par- 
's titioning in this implementation is based on "names* of 
the registers. 

The disk 130 can be used to persistently store the 
instructions 121 and the data 1 22 on magnetic or optical 
media while the computer system 100 is operating, or 

20 not. The instructions 121 and data 122 can be part of 
larger software systems 131 and databases 132 
sourced via the I/O interface 140. 

The I/O interface 140 can be used to communicate 
instructions and data with users 141, other peripheral 

25 components, and other computer systems in a distrib- 
uted network of computers. The system bus 150 is used 
to transport timing, control, address, and data signals 
during operation of the system 100. 

During operation of the computer system 100, the 

30 instructions 121 and the data 122 are, typically, first 
fetched from the disk 1 30 or the I/0 1 40. The instructions 
121, while being executed, manipulate the data 122. 
Each instruction usually includes an operator code (op- 
code), and one or more operands. The opcodes tell the 

35 processor circuits how to manipulate the data stored at 
the addresses specified in the operands. 

The instructions 121 and the data 122 are first 
stored in the caches of the CPU 200 while they are being 
processed. During immediate processing of the op- 

40 codes by execution units of the CPU 200, the data 122 
are stored in the registers addressed by the register op- 
erands. Processed data can be transported back, via 
the bus 150, to the memory 120, disk 130 for storage, 
and to the I/O interface for further communication. 

45 Figures 2A and 2B shows an arrangement of the 
CPU 200 according to a the preferred embodiment. Fig- 
ure 2A shows an instruction generator 199 and a data 
generator 1 98 respectively generating the instructions 
1 21 and the data 122. The generators 1 98 and 1 99 can 

so be software programs or, in real-time systems, the gen- 
erators may be implemented as specialized processors 
or other hardware circuits. In Figure 2B, connections 
which carry data signals are indicated as solid lines, and 
connections for control signals are shown as broken 

55 lines. 

The CPU 200 includes an instruction cache (I- 
cache) 201 and a data cache (D-cache) 202. The cach- 
es 201-202 are connected to the memory 120 by th 
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bus 1 50. A branch pr diction unit (BPU) 203 and an in- 
struction distribution unit (IDU) 204 are connected toth 
l-cache 201. Output of the IDU 204 ar connected to 
two execution clusters 280 and 290, detailed b low. The 
output of th clusters 280 and 290 are connected to the 
D-cach 202. 

During operation of the CPU 200, the instructions 
121 are fetched from the memory 120 via the bus 150 
and stored in the l-cache 201. The order in which the 
instructions 121 are fetched is determined in part by the 
BPU 203. This means, that the instructions 121 are 
fetched dependent on a predicted behavbr of the exe- 
cution flow based on previously executed instructions. 

As shown in Figure 2A and 2B, the IDU 204 can 
concurrently distribute multiple, e.g., eight, instructions 
to the clusters of the CPU 200, four to each one of the 
execution clusters 280 and 290. In the preferred embod- 
iment of the invention, the instructions 1 21 are distrib- 
uted to the clusters so that the register addresses of the 
data 122 accessed by the instructions 121 are substan- 
tially disjoint between the clusters 280 and 290. 

The IDU 204 includes a distribution buffer 205 to 
store instructions that are being distributed to the exe- 
cution clusters 280 and 290. As the instructions 1 21 are 
fetched from the instruction cache 201 , the IDU 204 as- 
signs each instruction a unique serial number. These se- 
rial numbers can be thought of as always increasing. 
During operation, there will be gaps in extant serial num- 
bers of instructions in progress due to the flushing of 
instructions on a branch mis-prediction, and other cir- 
cumstances. 

Therefore, the range of the serial numbers needs 
to be larger than the maximum number of instructions 
which can be pending at any one time. Similarly, it is 
convenient for control purposes to have the range of 
possibly extant serial numbers be large. A large range 
of serial numbers simplifies the computation of a relative 
age of pending instruction. In an actual implementation, 
the number of bits used to store the serial number need 
only to be sufficiently large to represent several times 
more instructions than the maximum possible number 
of instructions in progress in the processor 200 at any 
point in time. 

In one embodiment of the invention, the instruction 
distribution logic includes a plurality of autonomous in- 
struction "pickers", one for each cluster. The pickers in- 
spect the instructions stored in the distribution buffer 
205, and copies the instructions to the clusters as need- 
ed. As each picker inspects the instructions in the dis- 
tribution buffer, a bit associated with the location where 
the instructions is stored is set. When all pickers have 
inspected the instruction, e.g., one bit is set for each 
cluster, the IDU 204 can reclaim the location of the dis- 
tribution buffer 205 to stor a next f tch d instruction. 

Th sequencing of the instructions through the buff- 
er 205 can be done by arranging th buffer 205 as a 
connected set of shift registers. Altemativ ly, if the buff- 
er 205 is arranged as a ring buffer, a h ad and tail pointer 



can control the sequencing. If the processor 200 in- 
clud s a large numb r of clusters, it may be advanta- 
g ous to provide a set of broadcast busses to buffer and 
distribut the instructions. In systems with a small 

s number of dust rs, a multiported distribution buffer 
would be a preferred implementation for distributing in- 
structions to the multiple execution clusters. 

Advantageously, distributing the instructions over 
multiple execution clusters makes possible the part it ion - 

10 ing of the register file into several smaller files, one for 
each execution cluster with only a small amount of du- 
plication between the files. Thus, each smaller register 
file can have a small number of ports resulting in lower 
access latencies, while the total collection of register 

is files still has a high port bandwidth overall. 

In a preferred embodiment of the invention, the or- 
der in which instructions are executed in the multiple ex- 
ecution clusters is chosen dynamically. Dynamic sched- 
uling to multiple clusters means that the decisions as to 

20 which instructions are executed are made at run-time. 
In contrast, traditional static scheduling typically deter- 
mines the order of instruction execution at the time the 
instructions are generated. Static scheduling cannot 
take advantage of information which is only available at 

25 run time, e.g., cache misses, processor status states, 
branch mis-prediction, etc. Therefore, dynamic sched- 
uling can have better performance than static schedul- 
ing. 

During dynamic instruction scheduling, registers 
30 may be allocated or "renamed." Allocating the physical 
registers to the operands requires that registers speci- 
fied in the operands of instructions be treated as "virtual" 
registers until the time that the instructions are ready to 
execute. Instructions with virtual register names have 
3S operands assigned to the physical registers at the time 
that the instructions are issued for execution. This has 
the advantage that the scheduling of instructions is not 
limited by conflicts in register addresses, but depends 
on true data dependencies and machine execution ca- 
40 pabilities. As a result, better performance can be 
achieved. 

In one embodiment of the invention, the CPU 200 
includes a plurality, e. g. , two or more, execution clusters 
280 and 290. As an advantage of the invention, the 

45 number of execution clusters can easily be scaled up to 
further increase the number of instructions which can 
concurrently be issued. As shown in Figure 2B, each of 
the plurality of clusters 280 and 290 respectively com- 
prise: register renaming units (RRU) 210-211; instruc- 

so Won dispatch buffers (IDB) 220-221, instruction sched- 
uling controllers (ISC) 230-231 , register files/bypass cir- 
cuits (RF/BC) 240-241, a plurality of, e.g., four, execu- 
tion units (EU) 250-251, and transfer staging buffers 
(TSB) 260-261. 

55 The plurality of xecution units 250 and 251 can 
each include, for xample, a fixed-point arithmetic unit, 
a floating-point arithmetic unit, a memory access (load/ 
stor ) unit, and a branch unit. A central controller 300, 
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describ d in further detail b low, synchronizes the ac- 
tiviti softh processor 200. 

During operation, each of the clusters 280 and 290 
concurrently receives instructions from the IDU 204 un- 
der direction of the central controller 300. Th distribut- s 
d instruction includes virtual r gister specifiers or 
"names" of the operands. The names are assigned 
"physical" names by the RRUs 210-211. The physical 
names of the register operands, for cluster 280, corre- 
spond to the addresses of the registers in register file 10 
240, and for the cluster 290, to the addresses of the reg- 
ister file 241. 

After distribution and renaming, the instructions for 
clusters 280 and 290 are respectively stored in the in- 
struction dispatch buffers 220 and 22 1 . At any one time, is 
each IDB 220 or 221 can store up to, for example, six- 
teen or thirty-two instructions. The locations of the IDBs 
220-221 used for storing instructions can randomly be 
addressable by the instruction scheduling controllers 
230 and 231. 20 

The ISCs 230-231 respectively dispatch the instruc- 
tions, e.g., "issue" the instructions, depending on in- 
struction types, e.g., arithmetic, access, branch, etc., to 
the execution units 250 and 251 . While processing the 
instructions, the execution units 250 and 251 respec- 25 
tively maintain the data 1 22 referenced by the operands 
of the instructions 1 21 in the register files 240 and 241 . 
The bypass circuits of the register files allow the execu- 
tion units to capture data directly from the signaling 
paths, as the data are being stored in the registers. 30 
Thus, the bypass circuits can save processor cycles 
when an execution unit needs data that has just been 
manipulated by another execution unit. Upon a comple- 
tion of processing, the data 122 can be transferred, via 
the D-cache 202 and bus 1 50, back to the memory 1 20. 35 

The central controller 300 coordinates the function- 
ing of the processor 200. The controller 300 coordinates 
the normal operation of the execution clusters, excep- 
tion conditions, and other unanticipated events. Howev- 
er, decisions as to how the instructions are to be issued 40 
to the execution units are delegated to the ISCs 
230-231. 

For reasons stated below, instructions having mul- 
tiple operands may be "cloned" and distributed to more 
than one cluster. Therefore, the coordination of most op- 45 
erand transfers happens as part of normal processing 
without substantial intervention by the central controller 
300. For example, an arithmetic operation executing in 
one cluster and reading operands from a cloned instruc- 
tion in another cluster, receives the operand values so 
fetched by the cloned. 

The central controller 300 also manages the com- 
mitment of instructions after the instructions have been 
successfully xecuted. Eachclust r maintains the serial 
number of the "oldest" instruction which has not yet ss 
completed ex cution. Each cycl , th central controller 
300 chooses the instruction with the oldest serial 
number, and broadcasts this value to all of the other 



clust rs. 

This enables the other clusters to commit all instruc- 
tions having serial numbers up to, but not including, th 
oldest serial number. One an instruction has b en com- 
mitted, th instruction, absolutely, cannot be reversed. 
Thus, any temporary buffers utilized by a pending in- 
struction can be freed upon the commitment of the in- 
struction. 

The central controller 300 also manages the states 
of pending instructions on a branch mis-prediction. In 
this case, the central controller 300 broadcasts the serial 
number of the first instruction which was executed in er- 
ror. The clusters, in response, delete the states corre- 
sponding to the erroneously executed instructions. For 
example, if an instruction with serial number 27 is a mis- 
predicted branch instruction, the states of all instructions 
with serial numbers greater than or equal to 27 are de- 
leted. Subsequently, the IDU 204 can fetch and distrib- 
ute instructions beginning at the correct branch target 
address, and assign serial numbers to the instructions. 
The assigned serial numbers are higher than the serial 
numbers of the incorrectly executed instructions. 

Similarly, on an exception or interrupt condition, the 
central controller 300 broadcasts the serial number of: 
either the instruction causing the condition, or the fol- 
lowing instruction, depending on the condition. The clus- 
ters, in response to the broadcast, can then delete all 
states associated with pending instructions having seri- 
als numbers not less than the broadcast serial number. 
Now, instruction fetching and distribution can resume as 
described for the branch mis-prediction. 

According to the preferred embodiment of the in- 
vention, the manner in which the instructions are con- 
currently distributed over the execution clusters 280 and 
290 is decided by the instruction distribution unit 204 
using "hints" encoded with the instructions. The hints 
are provided by the generator 199 of the instructions 
121 , for example, a compiler, an assembler, or a hard- 
ware instruction generator. 

The distribution of the instructions 121 is such that 
the amount of data to be communicated between the 
clusters 280 and 290 is reduced. Reducing the amount 
of data to be communicated can lead to a reduction in 
the number of signaling paths or ports of the register 
files 140-141 that are required for optimal performance, 
which in turn reduces the complexity of the processor 
200. 

In the preferred embodiment, the instructions are 
distributed to the execution clusters 280 and 290 so that 
the number of intermediate transfers of signals from one 
execution cluster to another is minimized. In most mod- 
ern processors, an intermediate transfer of signals 
would require an additional processor cycle, thus, fewer 
transfers require few r processing cycl s overall. How- 
ever, to the extent that additional transfers are not on 
th critical path of th computation, additional transf rs 
over the minimum r quired may b helpful in more even- 
ly balancing the computation among th multipl clus- 



5 



9 



EP 0 767 425 A2 



10 



tors. 

Figure 3 shows an example portion 310 of th in- 
structions 121. The portion 310 includes instructions 
which first load regist rs named R3 and R6 with values. 
Then, the stored values are added to th constant value 
T to produces cond values in r gist rs named R7 and 
R8. The load instructions have single operands, and the 
add instructions have two source and one destination 
operands. Although this example only shows four in- 
structions, it should be understood that the invention can 
also be worked with a larger number of instructions. 

The processor 200, as shown in Figures 2Aand 2B, 
includes two distinct execution clusters 280 and 290 for 
processing the instructions 121. Therefore, in a pre- 
ferred embodiment, instructions 380 referencing "odd 1 ' 
registers R3 and R7 are distributed by the IDU 204 to 
the first execution cluster 280, and instructions 390 ref- 
erencing 'even' registers R6 and R8 are distributed by 
the IDU 204 to the second signaling cluster 290. 

If the full execution of each of the instructions takes 
one processor cycle, the four instructions of the program 
segment 310 can be executed in two cycles. Because 
the register addresses in the clusters 280 and 290 are 
distinct, the instructions can execute without interfer- 
ence, and no intermediate transfers of signals between 
execution clusters 280 and 290 are required. 

In reality, the instructions 121 typically depend on 
at least one, and often, more than one operand. Good 
scheduling of instructions to the signaling clusters 280 
and 290 to achieve minimum execution time is a difficult 
problem. Perfect scheduling that results in a minimum 
execution time is intractable. The invention provides a 
solution to this problem that achieves good performance 
using straight forward and efficient circuits. 

The invention proposes that a solution to the prob- 
lem is partially provided, in an upward extendible way, 
by hints provided with the register operands of the in- 
structions 121. The hints are supplied by the generator 
199 of the instructions 121. In general, depending on 
the number of parallel execution clusters which com- 
prise the processor, the generator 199 partitions the in- 
structions into a like number of sets. If the generator 1 99 
is a compiler, then the partitioning can be by the virtual 
naming of the register operands. By virtually naming the 
register operands during instruction generation, physi- 
cal register assignment or allocation can be performed 
dynamically when the instructions are issued for execu- 
tion. 

For example, if there are two execution clusters, the 
instructions are partitioned into two sets. The instruc- 
tions are assigned to the sets so that the virtual names 
of registers specified in operands of the instructions are 
substantially disjoint. This means that the virtual names 
used in the operands of the two sets of instructions are 
mostly non-overlapping. If the register naming is sub- 
stantially disjoint, then there is minimal n d for the two 
r gister files 240-241 to communicate data with each 
other while th units 250 are concurrently ex cuting in- 



structions. 

In other words, instructions which include register 
Virtual" operands which are even, .g. ( R6 and R8, are 
assign d to the first xecution cluster 280 by the IDU 
5 204. Odd register operands, such as R3 and R7, ar 
assign dtoth s cond xecution cluster290 by the IDU 
204. 

Alternatively, the virtual naming of the registers, and 
the distribution of the instructions 121 among the exe- 

10 cution clusters can be by range of addresses of the reg- 
isters. Take, for example, a processor equipped with 
thirty-two fixed and floating point registers, e.g., R0, 
R1, .... R31, and RF0-RF31. Registers in the range of 
R0-R15 : are assigned to the first register file 240 and 

is execution cluster 280, Registers R16-R31 are assigned 
to the second register file 241 and execution cluster 290. 
Floating point registers, e.g., RF0-RF31, can similarly 
be assigned. 

Registers which store constant values, e.g., zero 

20 and one, can be accessed from any signaling clusters. 
It may be beneficial to have several registers appear in 
all of the clusters, with parallel transfers resulting in all 
of the clusters for any write access. 

This partitioning of the instructions, clusters, and 

25 register files can be upward extended to processors 
having more than two clusters. For example, in a proc- 
essor with four execution clusters, the run time assign- 
ment of the thirty-two registers can be in groups of eight. 
However, as an advantage, the same program can still 

30 execute in the processor of Figure 2, where the distri- 
bution of the instructions, and the assignment of the reg- 
isters is on the basis of two clusters. 

Instructions which solely use registers allocated to 
one of the execution clusters are only distributed to the 

35 cluster containing those registers. An instruction with 
operands which use registers of more than one cluster 
needs to be distributed to all of the clusters that contain 
any of the source and destination registers of the in- 
struction. This allows the instruction issue and schedul- 

40 jng hardware in each of the clusters to operate properly 
taking into account constraints posed by distributing in- 
structions to multiple clusters. 

In the case where the distribution of the instructions 
cannot be perfectly disjoint with respect to the operand 

45 addresses, the invention provides means for transfer- 
ring data between the plurality of clusters. In a preferred 
embodiment, the transferring means includes associa- 
tive memory buffers, explained in greater detail below. 
In the case, where an instruction includes register 

50 operands of more than one cluster, the instruction 
should be executed in the cluster where the majority of 
the operands of the instruction have their registers. This 
minimizes the number of intermediate data transfers. 
For xample, an instruction which has two sourc oper- 

55 ands in a first cluster and a destination register in a sec- 
ond cluster should b xecuted in the first x cution 
cluster, and only forwarding the result to the destination 
r gist r of the second execution clust r. 
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In th case where no clust r has a majority of the 
registers, the instruction can be execut d in any cluster 
containing one of the register operands, although exe- 
cution in the "destination" cluster would b thepr f rr d 
case. Wh n the instructions ar distributed across mul- 
tiple clust rs, any source operands pr sent in a cluster 
not executing the instruction needs to forward its results 
when available. Similarly, when an execution cluster 
produces a result destined for another cluster, the result 
needs to be forwarded when available. 

In a preferred embodiment, the transfer staging 
buffers 240-241 are used to forward results from one 
cluster to another. Each transfer staging buffer is con- 
figured as an associative memory. The transfer buffers 
240-241 can be associatively addressed by, for exam- 
ple, the instruction serial numbers, register addresses, 
or transaction types of the registers, e.g., result oper- 
and, or source operand. 

To further minimize the need for transfers, some of 
the registers may automatically be updated. Registers 
that are automatically updated can be virtual registers 
which have corresponding physical registers in each of 
the execution clusters. The values stored in these "au- 
tomatic" registers are updated whenever any one of the 
clusters writes a new value to the registers. Examples 
of automatically updated registers could include stack 
pointers, or any other special purpose register generally 
used by the instructions. In the case where an instruc- 
tion writes to an automatic register, a copy of the instruc- 
tion is distributed to each cluster having a copy of the 
automatic register. 

In order to properly recover correct data states after 
a branch mis-prediction, or an exception or interrupt 
condition, the D-cache 202 can be equipped with a store 
buffer 206. Normally, data are maintained in the store 
buffer 206 until all instructions having lower serial num- 
bers have been committed. That is, the data are not writ- 
ten to the D-cache until it is certain that any instruction 
needing to operate on the data has successfully com- 
pleted. 

For, example, for a "store" instruction that deter- 
mines the destination address in one cluster, and re- 
ceives the source data from another cluster, cloned cop- 
ies of the store instruction are sent to both clusters by 
the instruction distribution unit 204. The cluster deter- 
mining the destination address stores the destination 
address in the store buffer 206 at a location of the D- 
cache 202 corresponding, for example, to the serial 
number of the store instruction. 

Concurrently, the data to be stored are sent to the 
D-cache 202 by the cluster generating the result data. 
The data are also placed in the store buffer 206, with 
corresponding address and data placed at the same lo- 
cation of the store buffer, based on th ir common in- 
struction serial number. Therefore, each location of th 
store buffer 206 can only hav address and data from 
one instruction, since each location of the buffer must 
have an unique instruction serial numb r. 



While data are "uncommitted," load requests for the 
uncomrnitt d data stored must be read from the store 
buffer 206. Thus, there can be multiple locations of the 
store buffer 206 corresponding to a destination address. 
s Whil by-passing the D-cach 202, a load instruction 
must take th data from the location having the highest 
serial number, e.g., the data generated by the most re- 
cently executed instruction with the same destination 
address, but not data stored at a location having the se- 
10 rial number of a store instruction which has a higher se- 
rial number than the load instruction. 

Because load and store instructions can operate on 
data having different data widths, e.g., 1 ,2, 4, or 8 bytes, 
one load instruction e.g., a load of 8 bytes, may have to 
is read data from several locations of the store buffer 206, 
e.g., some of the data may come from locations where 
uncommitted data are written, and some of the data may 
come from the D-cache 202 itself. 

A load instruction having register operands in more 
than one cluster has copies of the instruction distributed 
to all of the clusters in which the load instruction has 
operands. Data to be fetched from the D-cache 202, 
may be preempted by data maintained in the store buffer 
206, if necessary. The data are then sent to the cluster 
containing the destination operand of the load instruc- 
tion. 

When a load instruction is issued to the cluster 
which will perform the source address calculation, a sig- 
nal is sent to the cluster containing the destination op- 
erand. This signal tells the copy of the load instruction 
in the "destination" cluster to issue. Because the issue 
of the load instruction in the destination cluster is de- 
layed by a cycle with respect to the issue of the bad 
instruction in the source cluster, any data loaded from 
the D-cache 202 can temporarily be stored in a toad 
staging buffer. Data in the load staging buffer may be 
by-passed for use in other computations in the cluster. 

In systems with a large number of clusters, there 
may be times when data required by several load in- 
structions need to be forwarded to a single destination 
cluster. In this case, the data may need to be retained 
in the load staging buffer for more than one cycle until 
a write port of the register file becomes available. Sim- 
ilarly, due to other constraints, a load instruction in the 
destination cluster may not be able to issue immediately, 
which also increases the number of cycles that the load 
data need to be retained in the load staging buffer before 
the data are written to the register file of the destination 
cluster. 

While specific implementations of the invention 
have been described, those familiar with the art will ap- 
preciate that the invention may be practiced in other 
ways while still remaining within the scope of the ap- 
pended claims. 
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er further comprises: 

a plurality of pickers, there being one 
picker for ach on of the plurality of exe- 
cution clusters, each one of th plurality of 

5 pickers making a determination if a partic- 

ular instruction needs to be copied from the 
distribution buffer to a corresponding one 
of the plurality of execution clusters, the 
particular instruction being distributed to 

10 the corresponding one of the plurality of ex- 

ecution clusters when each one of the plu- 
rality of pickers has made the determina- 
tion, and in response to distributing the par- 
ticular instruction, the means for concur- 

is rently distributing fetching a next, instruc- 

tion from the memory. 

4. The processor of claim 3 wherein the distribution 
buffer further comprises: 
20 a plurality of shift registers connected to each 

other to determine an order in which the plurality of 
instructions stored in the distribution buffer are dis- 
tributed to the plurality of execution clusters. 

25 5. The processor of claim 3 wherein the distribution 
buffer further comprises: 

a ring buffer having a head pointer and a tail 
pointer to determine an order in which the plurality 
of instructions stored in the distribution buffer are 

30 distributed to the plurality of execution clusters. 



Claims 

1 . A superscalar processor, comprising 

a plurality of execution clusters, each execution 
cluster including a plurality of execution units, 
each one of the plurality of execution clusters 
concurrently executing a plurality of instruc- 
tions; and 

means, connected to the plurality of execution 
clusters, for concurrently distributing the plural- 
ity of instructions to each of the plurality of ex- 
ecution clusters so that addresses of data ac- 
cessed by the plurality of instructions are sub- 
stantially disjoint among the plurality of execu- 
tion clusters in order to minimize the number of 
interconnects and data transfers between the 
plurality of execution clusters. 

2. The processor of claim 1 wherein each one of the 
plurality of execution clusters further comprises: 

a register file, each register file including a 
plurality of registers, each one of the plurality of reg- 
isters having a physical address, the plurality of in- 
structions accessing the data stored in the plurality 
of registers by the physical addresses, the physical 
addresses of the registers of the plurality of regis- 
ters files being substantially disjoint among the plu- 
rality of execution clusters. 

3. The processor of claim 2 wherein the plurality of in- 
structions include instructions having operators and 
operands, the operands including virtual register 
addresses, and 

wherein each of the plurality of execution clus- 
ters further comprises: 

a register renaming unit, the register renaming 
unit dynamically assigning the physical ad- 
dresses of the plurality of registers to the virtual 
register addresses according to the distribution 
of the plurality of instructions among the plural- 
ity of execution clusters, wherein further, the 
means for concurrently distributing further com- 
prises: 



6. The processor of claim 1 further comprising: 

means for fetching the plurality of instructions 
to be executed in the plurality of execution clusters 
35 dependent on a predicted execution flow of the plu- 
rality of instructions. 

7. The processor of claim 1 wherein each instruction 
of the plurality of instructions include an operator, 

40 and 

wherein each of the plurality of execution clus- 
ters further comprises: 

a dispatch buffer, the dispatch buffer storing the 



45 plurality of instructions to be issued to the plu- 

rality of execution units of the corresponding 
execution cluster; and 

an instruction scheduling controller, connected 
to the dispatch buffer, issuing a particular in- 
so struction stored in the distribution buffer to a 

particular one of the plurality of execution units 
based on the operator of the particular instruc- 
tion. 

55 8. The processor of claim 3 further comprising: 

a central controller to manage a commitm nt 
of an executed instruction, th central contrail r fur- 
ther comprising: 



a distribution buffer, the distribution buffer 
having a plurality of locations storing the 
plurality of instructions while concurrently 
distributing to the plurality of execution 
clusters; and 

means for assigning a unique serial 
number to each of the plurality of instruc- 
tions stored in the distribution buffer, the 
serial numbers assigned in the order that 
th plurality of instructions are fetch dfrom 
a memory storing the plurality of instruc- 
tions, wh r in further, th distributi n buff- 
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means for releasing temporary buff rs utiliz d 
by the executed instruction if th executed in- 
struction was successfully executed; and 
means for deleting all states of the executed in- 
struction if the executed instruction was unsuc- 5 
cessfully executed. 

9. The processor of claim 1 wherein the plurality of in- 
structions include instructions having operators and 
virtual register addresses of operands, and further 10 
comprising: 

an instruction generator assigning the virtual 
register addresses to operands of the plurality of in- 
structions in order to maximize the number of the 
plurality of instructions which are disjoint among the is 
plurality of execution clusters according to the ad- 
dresses of data accessed by the operands of the 
plurality of instructions, wherein the virtual register 
addresses are partitioned into a plurality of sets, 
there being one set for each of the plurality of exe- 20 
cution clusters, wherein the plurality of sets further 
comprises: 

a first set having even virtual register address- 
es, and 25 
a second set having odd virtual register ad- 
dresses. 

10. The processor of claim 2 wherein the plurality of in- 
structions include instructions having a plurality of 30 
operands, each operand having a virtual register 
address, and 

wherein a copy of a particular instruction is 
copied to every one of the plurality of execution 
clusters having physical addresses of registers cor- 35 
responding to any of the virtual register addresses 
of the plurality of operands. 

11. A method for executing instructions in a superscalar 
processor, comprising the steps of: 40 

generating a plurality of sets of instructions, the 
addresses of data to be manipulated by oper- 
ands of each one of the plurality of sets of in- 
structions being substantially disjoint among 
the plurality of sets of instructions; 
concurrently distributing the plurality of sets of 
instruction to a plurality of execution clusters so 
that the addresses of data accessed by the plu- 
rality of sets of instructions are substantially so 
disjoint among the plurality of execution clus- 
ters to minimize the number of interconnects 
between the plurality of execution clusters. 
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